A naive approach to modelling curves

Suppose we have several realisations of a pitch accent in two conditions. For simplicity:

We could try something like this and use lm():

\[ y_i = \beta_0 + \beta_{0, P} + \beta_1 t_i + \beta_{1, P} t_i + \beta_2 t_i^2 + \beta_{2, P} t_i^2 + \beta_3 t_i^3 + \beta_{3, P} t_i^3 + ... + \epsilon_i \] where:

In R it would look like this:

mod <- lm(y ~ Category + time + Category:time + I(time^2) + Category:I(time^2) + I(time^3) + Category:I(time^3) , data = curves)

although this is quite a naive approach, see orthogonal polynomials (poly in R).

Problems

  • We don’t know where to stop with power terms
    • but we could use model selection
  • Pure power series are really bad at interpolating general shapes
    • because Taylor’s theorem is about approximating curves near a specific point, not globally
  • How are we going to use the \(\beta\) terms for insight and interpretation?

GAMMs

Slides by Simon Wood

The model looks like:

\[ y_i = f(t_i) + f_P(t_i) + \epsilon_i \] where each \(f()\) is a SPLINE: \[ f(t) = \sum_{k=1}^K \beta_k b_k(t) \] In the GAMMs lingo, the \(f()\) are called smooths. In simple terms, we can imagina a smooth to be like a polynomial of unspecified degree, or in general a functional term of unspecified complexity, i.e. we do not say how many terms it should include. The GAMM machinary takes care of determining how many terms are strictly needed for our problem.

GAMMs in R

  • Package mgcv is the most popular for GAMMs within R
  • The main command is bam(), equivalent to lmer() in lme4
  • The R formula syntax is quite different from the one used in lme4
  • The model specification with bam() has more possibilities than lmer()
    • fixed and random predictors can be declaired as parametric (like with lmer()) or as smooths, or combinations
    • interactions between covariates (= continuous predictors) has different options based on tensors
  • Interaction between factors (a * b) is not available with smooths
    • the user has to create their own interaction terms manually beforehand (more on this later)

Specifying fixed effects

A single smooth

Equation R formula
\(y_i = \beta_0 + f(t_i) + \epsilon_i\) y ~ s(time)

Note that regular smooths like s(time) are centered on zero, hence the intercept term in the equation. The model summary() will show two separate terms:

  • (Intercept), i.e. \(\beta_0\)
  • s(time), i.e. the zero-centered function \(f(t_i)\)

Example:

m1 <- bam(y ~ s(time), data = curves)
summary(m1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(time)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.179931   0.002041   88.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(time) 8.036  8.753 618.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.352   Deviance explained = 35.3%
## fREML = -1695.7  Scale est. = 0.041448  n = 9950

Reading the summary

  • Parametric coefficients: the lm-style fixed effects
  • edf: estimated degrees of freedom of smooth terms. More or less, how many coefficients you need to specify the shape of a smooth. The default max number of coefficients is k = 10, minus 1, as the smooth is zero-centered, so the intercept is already specified
  • fREML: fast restricted maximum likelihood estimation. Cannot be used to compare models with different fixed effects composition. For this you need to re-run your models in ML modality, see section 4.5.1 in Wieling’s tutorial.
  • Scale est.: residual variance
  • n: number of samples
    • number of curves? missing! (more on this later)

Plotting

  • Plotting is essential!
  • Make sure you know what you are plotting (see itsadug docs):
    • partial effects
    • summed effects

Partial effects are smooths, which are zero centered in their default mode. This means that they represent the ‘shape part’ of the effect, i.e. the total effect minus the mean across the time axis (or whatever continuous variable you are using)

Smooths are (implicitly) numbered by they order of appearance in the summary() output. To plot a selected smooth, two options (among many):

First option: use mgcv::plot:

plot(m1, select = 1, shade = TRUE, rug = FALSE)
abline(h=0, col='red', lty=2)

Second option: extract fitting data from model, then plot it with ggplot or plotfunctions::plot_error.

smooth1 <- get_modelterm(m1, select = 1)
## Summary:
##  * time : numeric predictor; with 30 values ranging from 0.000000 to 2.000000.
smooth1$time
##  [1] 0.00000000 0.06896552 0.13793103 0.20689655 0.27586207 0.34482759
##  [7] 0.41379310 0.48275862 0.55172414 0.62068966 0.68965517 0.75862069
## [13] 0.82758621 0.89655172 0.96551724 1.03448276 1.10344828 1.17241379
## [19] 1.24137931 1.31034483 1.37931034 1.44827586 1.51724138 1.58620690
## [25] 1.65517241 1.72413793 1.79310345 1.86206897 1.93103448 2.00000000
smooth1$fit
##  [1]  0.007429533  0.048030083  0.087817603  0.125320472  0.158851066
##  [6]  0.186758645  0.207393690  0.219023389  0.219971889  0.209074531
## [11]  0.186265159  0.152961379  0.111990148  0.067048716  0.021942368
## [16] -0.020085368 -0.056739554 -0.086789705 -0.110041611 -0.127247709
## [21] -0.139910995 -0.149879879 -0.158736161 -0.167185194 -0.174788701
## [26] -0.180282388 -0.182399062 -0.180761816 -0.176267289 -0.170565204
smooth1$se.fit
##  [1] 0.02168769 0.01289966 0.01052936 0.01144544 0.01072990 0.01021707
##  [7] 0.01085440 0.01073402 0.01029606 0.01067955 0.01077297 0.01034843
## [13] 0.01054433 0.01080848 0.01045164 0.01045726 0.01082012 0.01055427
## [19] 0.01035298 0.01077257 0.01067386 0.01028435 0.01072127 0.01084525
## [25] 0.01020879 0.01071573 0.01142727 0.01052660 0.01294317 0.02176024
ggplot(smooth1) +
  aes(x = time, y = fit) +
  geom_line() +
  geom_ribbon(mapping = aes(ymin = fit - se.fit, ymax = fit + se.fit), fill = "red", alpha = 0.5) +
  theme_light() +
  theme(text = element_text(size = 15))

Summed effects include smooths and intercept (parametric) terms that pertain to an effect. In our case it means \(\beta_0 + f(t_i)\).

The easy way to plot summed effects is to use itsadug::plot_smooth:

plot_smooth(m1, view = "time", rug = FALSE)
## Summary:
##  * time : numeric predictor; with 30 values ranging from 0.000000 to 2.000000. 
##  * NOTE : No random effects in the model to cancel.
## 

One could also get all the relevant values and sum them manually. In our case, get samples from the smooth \(f(t_i)\) using get_modelterm as above, then add the value of \(\beta_0\) taken form the summary (i.e. m1$coefficients["(Intercept)"]).

Question: the confidence intervals are quite thin in comparison with the impression of noisy signals from the raw data plot. Why is that?

Diagnostics

Degrees of freedom

The specification of smooths s() has an extra argument k whose default value is 10. k indicates the maximum number of splines allowed to specify the smooth, or in other words the maximum number of parameters or degrees of freedom. As regular smooths are zero-centered, this means that a smooth with at most k = 10 parameters has actually at most k' = 9 degrees of freedom, as one is absorbed by the intercept term. The estimated degrees of freedom edf is an indicator for how many ‘equivalent parameters’ were actually utilised in the estimated smooth. When edf is near 1, it means that it’s basically a straight line (only the slope parameter), in which case we would rather remove the smooth and use an ordinary linear term. On the contrary, when edf is very close to k' = k-1 as in this case, it means that it has used up all its degrees of freedom, thus perhaps more might be needed.

Let’s specify a higher k and see what happens to edf for s(time):

m1.k20 <- bam(y ~ s(time, k = 20), data = curves)
summary(m1.k20)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(time, k = 20)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.179931   0.002041   88.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df   F p-value    
## s(time) 10.08  12.33 439  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.352   Deviance explained = 35.3%
## fREML = -1695.4  Scale est. = 0.041444  n = 9950

edf has not increased much, and it’s well below k - 1 = 19, so the limit on k = 10 was ok.

Residuals

gam.check(m1)

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 10 iterations.
## Gradient range [-6.91216e-11,6.825118e-11]
## (score -1695.72 & scale 0.04144809).
## Hessian positive definite, eigenvalue range [3.749254,4974.002].
## Model rank =  10 / 10 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##           k'  edf k-index p-value
## s(time) 9.00 8.04    1.02    0.95
  • Quantile plot and histogram indicate how close to Gaussian the distribution of residuals is, i.e. how close we are to the hypothesis \(\epsilon \sim N(0, \sigma)\).
  • Residuals vs. linear predictor shows whether trends are visible in the residuals – which they shouldn’t. For example, if residuals have a wider span for some values of the predicted output it means that the data are heteroscedastic, i.e. the variance of the residuals is not independent of predicted output, but it varies with it. Sometimes, especially with time series, this problem is mitigated by using a different distribution for the residuals, the so-called scaled-t. But often it also means that the model is ill-specified, and a different predictor structure is needed.
  • Response vs. For Gaussian residuals, Fitted Values provides a similar picture as Residuals vs. linear predictor. The difference is that the y axis reports the actual output values (y in the example), and not the residual errors.

A fixed factor interacting with a smooth

Equation R formula
\(y_i = \beta_0 + \beta_P + f_{NP}(t_i) + f_P(t_i) + \epsilon_i\) y ~ Category + s(time, by = Category)

where \(P\) indicates the PEAK category, while the default (control) level is NO_PEAK (\(NP\)), and \(i\) is the sample (not curve) index. The equation is omitting the indicator functions. The complete version is: \[ y_i = \beta_0 + \beta_P \cdot I(i \in P) + f_{NP}(t_i) \cdot I(i \in NP) + f_P(t_i) \cdot I(i \in P) + \epsilon_i \] Note the asymmetry in the conventions for the parametric and the smooth parts of the model:

  • Parametric part: as in lm(), \(\beta_0\) is the intercept, \(\beta_P\) is the difference between level \(P\) and the intercept
  • Smooths: one for each level of the fixed factor, all zero-centered

Question: What are the summed effects for the PEAK and the NO_PEAK levels (in maths notation)?

Estimate the model with bam():

curves %<>% mutate(Category = factor(Category)) # don't forget!
m2 <- bam(y ~ Category + s(time, by = Category), data = curves)
summary(m2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ Category + s(time, by = Category)
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1796755  0.0002261  794.78   <2e-16 ***
## CategoryPEAK 0.0242948  0.0003197   75.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                           edf Ref.df     F p-value    
## s(time):CategoryNO_PEAK 8.980      9 48196  <2e-16 ***
## s(time):CategoryPEAK    8.998      9 37123  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.975   Deviance explained = 97.5%
## fREML = -47131  Scale est. = 0.00050838  n = 19900

Practical tip: all variables that need to be used as fixed or random factors (here Category) must be converted to R class factor, otherwise bam will throw an error.

Question: is PEAK significantly different from NO_PEAK?

We can read off the summary output the statistical significance of parametric terms in the same way as with lm. In geometrical terms, we can say whether the average level of y along the time axis is different for PEAK and NO_PEAK curves.

With the model expressed as above, we cannot use the significance levels in the summary to answer questions about the difference between smooths. A small p-value only says that the smooth is significantly different from zero, which means that its shape is different from the flat line y = 0. If both PEAK and NO_PEAK smooths appear with a small p-value, it means that their shapes are not flat, i.e. in both cases y varies non-linearly with time, but nothing can be said about whether their respective shapes are similar or not.

Let’s plot first. A look at the summed effects:

plot_smooth(m2, view = "time", plot_all = "Category", col = Category.colors)
## Summary:
##  * Category : factor; set to the value(s): NO_PEAK, PEAK. 
##  * time : numeric predictor; with 30 values ranging from 0.000000 to 2.000000. 
##  * NOTE : No random effects in the model to cancel.
## 

A graphical way to detect statistical differences between effects is:

plot_diff(m2, view = "time", comp = list(Category = c("PEAK", "NO_PEAK")))
## Summary:
##  * time : numeric predictor; with 100 values ranging from 0.000000 to 2.000000. 
##  * NOTE : No random effects in the model to cancel.
## 

## 
## time window(s) of significant difference(s):
##  0.585859 - 0.747475
##  0.828283 - 0.989899
##  1.050505 - 1.191919
##  1.232323 - 1.919192
##  1.959596 - 2.000000

Testing for significance: emmeans

The powerful emmeans function can be used on mgcv GAMMs as a way to test pointwise comparisons. Suppose we want to test whether PEAK and NO_PEAK differ at time equal 0.0, 1.0 and 2.0. We can use emmeans in the same way as we do with lm or lmer, and we will get the correction for multiple comparisons.

emmeans(m2, pairwise ~ Category | time, at = list(time = 0:2))
## $emmeans
## time = 0:
##  Category   emmean       SE    df lower.CL upper.CL
##  NO_PEAK   0.18448 0.001316 19880  0.18190  0.18706
##  PEAK      0.18204 0.001318 19880  0.17945  0.18462
## 
## time = 1:
##  Category   emmean       SE    df lower.CL upper.CL
##  NO_PEAK   0.18333 0.000655 19880  0.18205  0.18462
##  PEAK      0.18496 0.000656 19880  0.18368  0.18625
## 
## time = 2:
##  Category   emmean       SE    df lower.CL upper.CL
##  NO_PEAK  -0.00104 0.001319 19880 -0.00363  0.00154
##  PEAK      0.03245 0.001316 19880  0.02987  0.03503
## 
## Confidence level used: 0.95 
## 
## $contrasts
## time = 0:
##  contrast       estimate       SE    df t.ratio p.value
##  NO_PEAK - PEAK  0.00244 0.001862 19880   1.313  0.1892
## 
## time = 1:
##  contrast       estimate       SE    df t.ratio p.value
##  NO_PEAK - PEAK -0.00163 0.000927 19880  -1.758  0.0788
## 
## time = 2:
##  contrast       estimate       SE    df t.ratio p.value
##  NO_PEAK - PEAK -0.03349 0.001863 19880 -17.978  <.0001

Testing for significance: binary smooths

While emmeans allows to test differences at specific points on the time axis (or other continuous dimensions), we need a way to assess for the significance of a smooth contrast as a whole. For the purpose, we utilise another way to specify a smooth interacting with a fixed factor with mgcv::bam, the so-called binary smooths.

A binary smooth is a smooth that represents the difference between effects (including intercepts). We use them to model the difference between two levels of a factor, or between more complex combinations of factor level interactions (more on this later). Once we are able to introduce such a term in the model, we obtain two things:

  1. we can assess the statistical significance of the difference between a pair of levels of interest by reading it off the model summary
  2. we can plot such difference and interpret it

In order to define a convenient binary smooth set up with mgcv, first we need to define numeric variables with the meaning of a logical variables (yes you read it right), which represent the desired contrasts – a rather obscure convention adopted by the mgcv library. In our case we want to represent the difference between PEAK and NO_PEAK, with NO_PEAK as control level:

curves$IsPEAK <- (curves$Category == "PEAK") * 1

where * 1 is a quick way to transform the logical value inside () into a numeric one. The R formula and the corresponding maths form are:

Equation R formula
\(y_i = \beta_0 + f_0(t_i) + b_P(t_i) \cdot I(i \in P) + \epsilon_i\) y ~ s(time) + s(time, by = IsPEAK)

where:

  • \(\beta_0\) is the global intercept
  • \(f_0(t_i)\) is the global, zero-centered smooth
  • \(b_P(t_i)\) is the not zero-centered binary smooth. It represents the total difference between PEAK and NO_PEAK
m2.bin <- bam(y ~ s(time) + s(time, by = IsPEAK), data = curves)
summary(m2.bin)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(time) + s(time, by = IsPEAK)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.1796561  0.0002261   794.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df     F p-value    
## s(time)        8.980      9 48195  <2e-16 ***
## s(time):IsPEAK 9.995     10  3227  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.975   Deviance explained = 97.5%
## fREML = -47132  Scale est. = 0.00050838  n = 19900
Maths R summary
\(\beta_0\) (Intercept)
\(f_0(t_i)\) s(time)
\(b_P(t_i)\) s(time):IsPEAK

Only now we are able to report about the significance of the difference between PEAK and NO_PEAK by reading off the summary output for s(time):IsPEAK.

How does the difference look?

get_modelterm(m2.bin, select = 2) %>%
  ggplot() +
  aes(x = time, y = fit) +
  geom_line() +
  geom_ribbon(mapping = aes(ymin = fit - se.fit, ymax = fit + se.fit), fill = "red", alpha = 0.5) +
  theme_light() +
  theme(text = element_text(size = 15))
## Summary:
##  * time : numeric predictor; with 30 values ranging from 0.000000 to 2.000000. 
##  * IsPEAK : numeric predictor; set to the value(s): 1.

There is a third modality for smooths, the so-called ordered factor difference smooth. It allows to separate the intercept difference form the shape difference, i.e. it splits the \(b_P(t_i)\) term into a constant and a zero-centered smooth. See section 4.5.3 in Wieling’s tutorial for details.

Two interacting covariates – smooth surfaces

Suppose we want to model the effect of trial, i.e. the shape of the curve y(time) does change not only according to a category (PEAK, NO_PEAK) but also along a numeric variable trial encoding at which point in the experiment we are. The data look like this:

This is a case of a two-dimensional, nonlinear interaction between two covariates, time (\(t\)) and trial \((r)\). The default way is to use a tensor.

Equation R formula
\(y_i = \beta_0 + \beta_P + f_{NP}(t_i, r_i) + f_P(t_i, r_i) + \epsilon_i\) y ~ Category + te(time, trial, by = Category)

Each of the tensor functions \(f\) is the generalisation of a complete interaction expression in lm:

Maths R formula in lm()
\(\beta_1 \cdot t_i + \beta_2 \cdot r_i + \beta_3 \cdot t_i \cdot r_i\) time * trial
m3 <- bam(y ~ Category + te(time, trial, by = Category), data = curves)
summary(m3)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ Category + te(time, trial, by = Category)
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1158235  0.0003097  373.98   <2e-16 ***
## CategoryPEAK 0.0242490  0.0004384   55.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                  edf Ref.df     F p-value    
## te(time,trial):CategoryNO_PEAK  9.00  9.001 29164  <2e-16 ***
## te(time,trial):CategoryPEAK    10.98 12.318 16825  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.959   Deviance explained =   96%
## fREML = -40952  Scale est. = 0.00094544  n = 19900

The surface plot for the PEAK and NO_PEAK conditions and slices along trial == 20 and trial == 80 (all summed effects) are:

If we were using lm instead of bam we would be able to produce similar surface visualisations, but in this case pretty bad estimates:

m3.lm <- lm(y ~ Category * time * trial, data = curves)
 summary(m3.lm)
## 
## Call:
## lm(formula = y ~ Category * time * trial, data = curves)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.258196 -0.066879  0.007525  0.072951  0.217637 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.014e-01  3.277e-03 122.471   <2e-16 ***
## CategoryPEAK            -1.182e-02  4.843e-03  -2.440   0.0147 *  
## time                    -2.223e-01  2.837e-03 -78.374   <2e-16 ***
## trial                   -1.318e-03  6.062e-05 -21.747   <2e-16 ***
## CategoryPEAK:time        3.526e-02  4.191e-03   8.414   <2e-16 ***
## CategoryPEAK:trial      -9.619e-06  8.605e-05  -0.112   0.9110    
## time:trial              -2.282e-06  5.248e-05  -0.043   0.9653    
## CategoryPEAK:time:trial  3.315e-05  7.444e-05   0.445   0.6561    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08783 on 19892 degrees of freedom
## Multiple R-squared:  0.6696, Adjusted R-squared:  0.6695 
## F-statistic:  5760 on 7 and 19892 DF,  p-value: < 2.2e-16

The surfaces themselves are non-linear, i.e. they are not flat, because of the product term \(\beta_3 \cdot t_i \cdot r_i\), but each slice along either time or trial is a straight line (this is called a ruled surface).

To test for significance differences between PEAK and NO_PEAK, we can use binary smooth surfaces as above:

curves$IsPEAK <- (curves$Category == "PEAK") * 1
m3.bin <- bam(y ~ te(time, trial) + te(time, trial, by = IsPEAK), data = curves)
summary(m3.bin)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ te(time, trial) + te(time, trial, by = IsPEAK)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.1158236  0.0003097     374   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                          edf Ref.df       F p-value    
## te(time,trial)         8.999   9.00 29166.9  <2e-16 ***
## te(time,trial):IsPEAK 11.985  13.33   972.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.959   Deviance explained =   96%
## fREML = -40957  Scale est. = 0.00094544  n = 19900
pvisgam(m3.bin, view = c("time", "trial"), select = 2, color = c('white', 'blue'), main = "PEAK - NO_PEAK", print.summary = FALSE)

The difference PEAK - NO_PEAK is then significant, but the surface plot suggests that such difference does not change much along trial. To disentangle this, another form of tensor product is available, ti, which isolates the interaction:

Equation R formula
\(y_i = \beta_0 + \beta_P +\) y ~ Category +
\(f_{1, NP}(t_i) + f_{1, N}(t_i) +\) s(time, by = Category) +
\(f_{2, NP}(r_i) + f_{2, N}(r_i) +\) s(trial, by = Category) +
\(f_{3, NP}(t_i, r_i) + f_{3, N}(t_i, r_i) + \epsilon_i\) ti(time, trial, by = Category)
m3.ti <- bam(y ~ Category + s(time, by = Category) + s(trial, by = Category) + ti(time, trial, by = Category), data = curves)
summary(m3.ti)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ Category + s(time, by = Category) + s(trial, by = Category) + 
##     ti(time, trial, by = Category)
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1160613  0.0002272  510.84   <2e-16 ***
## CategoryPEAK 0.0243985  0.0003261   74.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                  edf Ref.df     F p-value    
## s(time):CategoryNO_PEAK        8.987  9.000 49023  <2e-16 ***
## s(time):CategoryPEAK           8.998  9.000 37892  <2e-16 ***
## s(trial):CategoryNO_PEAK       1.000  1.000 28275  <2e-16 ***
## s(trial):CategoryPEAK          3.402  4.211  6444  <2e-16 ***
## ti(time,trial):CategoryNO_PEAK 3.999  4.000  4417  <2e-16 ***
## ti(time,trial):CategoryPEAK    8.654 10.451  1642  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.978   Deviance explained = 97.8%
## fREML = -47066  Scale est. = 0.00050877  n = 19900

In this case the non-linear interaction between time and trial appears to be significant (as there is little noise in the data), though its effect is minimal, judging from the contour levels.

Another simplification of the model apparent from eye inspections is be to make the smooth over trial a linear term and independent of Category, i.e. substitute s(trial, by = Category) with just trial, or in maths notation \(\beta_2 \cdot r_i\). The fact that the term is linear is suggsted by its shape and by the edf values of the respective smooths.

Random effects

Just like we moved from lm() to lmer(), we can build mixed-effects GAMs, or GAMMs. Random effects have the same meaning with GAMMs as with LMERs. The difference is that we can model variations in shape, i.e. random smooths.

The mgcv library allows for three different sorts of random terms, which in practice allow for three degrees of complexity:

  1. Random intercepts: only a constant
  2. Random slops: a constant plus a slope (tilt)
  3. Random smooths: arbitrary shape (within the limit set by the number of splines k)

This is best illustrated in Fig. 6 at p. 10 of this paper by van Rij et al..

Suppose we have curves along the time axis, the fixed factor Category as before, and subject (subjId) as random factor. With mgcv, random intercepts are indicated as s(subjId, bs = "re"). This means that subject-specific effects only vary the global height of the Category-specific curves, i.e. they can shift those curves up or down but do not change their shape. This is the equivalent of (1 | subjId) in lmer notation.

Random slope… it depends. These are a source of great confusion. If the slope is with respect to a covariate, say time, then slope means slope in the common sense, i.e. a tilted line. But just like with lmer, random slopes can be also specified with respect to a fixed factor, say Category, which will behave in the same manneer as with lmer, i.e. the ‘slope’ is the difference between the intercept and the specific level, e.g. PEAK - NO_PEAK.

A random slope with respect to time would be: s(subjId, t, bs = "re"), while a random slope with respect to a fixed factor like Category would be: s(subjId, Category, bs = "re"). The first one is the one represented in the figure from the paper by van Rij. A random slope is equivalent to (0 + t | subjId) in lmer notation. Note the explicit absence of intercept. This is because there is no way to estimate correlations between intercept and slope, as in lmer. If we want to model both random intercept and slope, we need to include both terms in the model, i.e. s(subjId, bs = "re") + s(subjId, t, bs = "re"), and yet the GAMM will not estimate the correlation between the two.

Finally, a random smooth is indicated as: s(t, subjId, bs = "fs"). Note that the first term indicates the covariate, the second the random factor.

Each of the three setting can be further complexified by indicating a by factor, which behaves according to the conventions used in mgcv, i.e. for each level of the by factor an independent term is added. For example, a random intercept can be split by Category with: s(subjId, by = Category, bs = "re"), which adds two random intercepts, one for PEAK, one for NO_PEAK. This achieves the same effect as s(subjId, Category, bs = "re").

A random slope could be: s(subjId, t, by = Category, bs = "re"), where the slope is with respect to t, while Category simply splits the term in two as before.

In the example below we use a random smooth split with respect to Category: s(t, subjId, by = Category, bs = "fs").

A case with subject-specific shape variations

A variation of the previous data set is depicted as follows:

The data set contains several y curves for each subject (a subset of subjects and trials in the plot above). Note that:

  • All subjects globally behave similarly with respect to Category
  • Each subject has their own departure from the general trend
  • The variation expressed on the first larger peak is consistent across Category levels within the same subject
  • In addition, there is also a consistent, subject-specific departure from the mean in the second peak in PEAK condition, though not obviously related to the one of the first peak

We model all this by introducing a random smooth terms. Such terms vary along time, are speaker-dependent, i.e. are added to the fixed effects for each speaker, and come in two versions, one per Category level.

The model in mgcv is expressed as follows (we are not modelling the effect of trial):

m4 <- bam(y ~ Category + s(t, by = Category) +
            s(t, subjId, by = Category, bs = "fs", m = 1), data = curves)
summary(m4)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ Category + s(t, by = Category) + s(t, subjId, by = Category, 
##     bs = "fs", m = 1)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.509038   0.008256  61.653  < 2e-16 ***
## CategoryPEAK 0.072021   0.012002   6.001 1.99e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                 edf  Ref.df      F p-value    
## s(t):CategoryNO_PEAK          8.926   8.936 106.89  <2e-16 ***
## s(t):CategoryPEAK             8.932   8.940 163.18  <2e-16 ***
## s(t,subjId):CategoryNO_PEAK 344.434 359.000  64.89  <2e-16 ***
## s(t,subjId):CategoryPEAK    346.545 359.000  68.53  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.88   Deviance explained = 88.2%
## fREML =  11644  Scale est. = 0.10527   n = 33600

The m = 1 argument is usually added to random smooths. The reason is to increase the non-linearity penalty, i.e. a stronger regularisation.

Let’s compere it with a version without the random terms.

m4.fixed <- bam(y ~ Category + s(t, by = Category), data = curves)
summary(m4.fixed)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ Category + s(t, by = Category)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.507969   0.003900   130.2   <2e-16 ***
## CategoryPEAK 0.072263   0.005516    13.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                        edf Ref.df    F p-value    
## s(t):CategoryNO_PEAK 8.991      9 4718  <2e-16 ***
## s(t):CategoryPEAK    8.992      9 4342  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.709   Deviance explained = 70.9%
## fREML =  24830  Scale est. = 0.25542   n = 33600

Note how R-sq and deviance explained are way higher for the model with random terms. In this case we can also directly affirm that the random terms are justified, as these are significant and they are the only difference between the two models.

The summed fixed effects look as expected:

plot_smooth(m4, view = "t", plot_all = "Category", col = Category.colors, print.summary = FALSE)

It is interesting to compare the result with the model without random terms:

plot_smooth(m4.fixed, view = "t", plot_all = "Category", col = Category.colors, print.summary = FALSE)

Note how ‘overconfident’ the simpler model is!!

Let’s visualise the shape of the random terms, picking out the first 5 subjects, to be compared with the raw data above:

inspect_random(m4, select = 3, cond=list(subjId = 1:5 %>% factor(), Category = "NO_PEAK"), main = "NO_PEAK", col = colorblind_pal()(5), lty = 1, lwd = 2, print.summary = FALSE)
inspect_random(m4, select = 4, cond=list(subjId = 1:5 %>% factor(), Category = "PEAK"), main = "PEAK", col = colorblind_pal()(5), lty = 1, lwd = 2, print.summary = FALSE)

Question: Do you see find by eye inspection the subject-specific consistency in the change of shape of the first peak across Category levels in the plot above (comparing with the raw data further above)?

Question: Does the GAMM model such consistency?

What is special about time-varying signals?

  • What is the difference between time-series and other types of functional data?
  • When we provide information (to the model) about the grouping of smooths induced by having different subjects, is there any other relationship in the data we would like to represent?

We would like to indicate which \((t_i, y_i)\) samples belong to the same curve! There are two ways of doing it:

  • Define another (nested) level of random smooth, call it the token
  • Alternatively, enrich \(\epsilon\) with an AR1 model to express correlation of adjacent samples
    • \(\epsilon_i = \rho \epsilon_{i-1} + \psi_i\)

Practical advice

  • Convert all character variables to factor
  • Convert binary variables to integer
  • GAMMs with random smooth slopes are very very slow, use a server, not your laptop, and use the nthreads option of bam()